Mentor: A Visualization and Quality Assurance Framework for Crowd-Sourced Data Generation

نویسندگان

  • Siamak Faridani
  • Georg Buscher
  • Johnny Ferguson
چکیده

Crowdsourcing is a feasible method for collecting labeled datasets for training and evaluating machine learning models. Compared to the expensive process of generating labeled datasets using dedicated trained judges, the low cost of data generation in crowdsourcing environments enables researchers and practitioners to collect significantly larger amounts of data for the same cost. However, crowdsourcing is prone to noise and, without proper quality assurance processes in place, may generate low quality data that is of limited value. In this paper we propose a human-in-the-loop approach to deal with quality assurance (QA) in crowdsourcing environments. We contribute various visualization methods and statistical tools that can be used to identify defective or fraudulent data and unreliable judges. Based on these tools and principles we have built a system called Mentor for conducting QA for datasets used in a large commercial search engine. We describe various tools from Mentor and demonstrate their effectiveness through real cases for generating training and test data for search engine caption generation. Our conclusions and the tools described are generalizable and applicable to processes that collect categorical and ordinal discrete datasets for machine learning.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pair Me Up: A Web Framework for Crowd-Sourced Spoken Dialogue Collection

We describe and analyze a new web-based spoken dialogue data collection framework. The framework enables the capture of conversational speech from two remote users who converse with each other and play a dialogue game entirely through their web browsers. We report on the substantial improvements in the speed and cost of data capture we have observed with this crowd-sourced paradigm. We also ana...

متن کامل

Improving the Accuracy of IVC Simulation Using Crowd-sourced Geodata

We discuss the use of crowd-sourced geodata in simulative evaluations of Inter-Vehicle Communication (IVC) protocol designs. Typically, network simulation tools, which have been improved over decades of network research, are used for evaluating communication systems. In the area of IVC, however, additional challenges have to be met. Most important, the mobility of vehicles in network simulation...

متن کامل

KnowledgeWiki: An OpenSource Tool for Creating Community-Curated Vocabulary, with a Use Case in Materials Science

Resource Description Framework (RDF) datasets can be created by transforming structured databases, extracting the triples from semi-structured and unstructured sources, crowd-sourcing, or by integrating the existing datasets. The reliability and quality of these datasets can be improved by the participation of domain experts via a special purpose tool or a crowd-sourced application. Wikidata an...

متن کامل

3D Reconstruction of Dynamic Textures in Crowd Sourced Data

We propose a framework to automatically build 3D models for scenes containing structures not amenable for photo-consistency based reconstruction due to having dynamic appearance. We analyze the dynamic appearance elements of a given scene by leveraging the imagery contained in Internet image photo-collections and online video sharing websites. Our approach combines large scale crowd sourced SfM...

متن کامل

Can the Crowd be Controlled?: A Case Study on Crowd Sourcing and Automatic Validation of Completed Tasks based on User Modeling

Annotation is an essential step in the development cycle of many Natural Language Processing (NLP) systems. Lately, crowdsourcing has been employed to facilitate large scale annotation at a reduced cost. Unfortunately, verifying the quality of the submitted annotations is a daunting task. Existing approaches address this problem either through sampling or redundancy. However, these approaches d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013